November 2025
Generative AI tools in low- and middle-income countries are multiplying
Problem: While some studies show effectiveness (e.g., Henkel et al., 2024), others show AI applications exhibit unexpected and unwanted behavior that can be harmful to users (e.g., Bastani et al., 2024)
Gap: While there is broad consensus on the importance of evaluating GenAI in the development sector, there has been little agreement on what this actually means
Consequence: In the absence of clear standards, organizations have adopted very different evaluation approaches
Tech-focused organizations: Emphasize model/product performance, neglect development outcomes
Development-focused organizations: Default to RCTs, ignore model and product evaluations
Funders: Lack clarity on what evaluations to expect or what “right-sized” evaluation entails
Reality: All methods are complementary and should be used together at different stages
This playbook organizes evaluation around four levels:
Level 1 – Model evaluation: Does the AI model produce the desired responses?
Level 2 – Product evaluation: Does the product facilitate meaningful interactions?
Level 3 – User evaluation: Does the product positively support users’ thoughts, feelings, knowledge, and behaviors?
Level 4 – Impact evaluation: Does access to the product improve development outcomes?
Unlike earlier rule-based digital tools, GenAI’s unique sensitivity to the underlying model, architecture, data, and prompts demands new evaluation methods
The underlying components can evolve far faster than in earlier digital technologies - with new AI models and technologies being released weekly
Developers must ensure their applications perform as intended over time, even as updates are released
Continuous evaluation thus becomes essential, enabling developers to:
This focus on continuous evaluation, while commonplace in software companies, might be less familiar in the development sector where programs are often judged by one-off experiments
Development sector: Programs often judged by one-off experiments (evaluation as finish line)
Our approach: Rapid, ongoing cycles where deployment, adaptation, evaluation, and improvement happen in tandem
This is commonplace in software companies, but less familiar in development sector
Tech sector: Typically stops at Levels 1-2 (engagement predicts success)
Development sector: Higher bar - does it improve lives in meaningful, measurable, cost-effective ways?
Key: All actors must see beyond their slice of the evaluation process
Four concrete, actionable steps that move evaluation from theory into practice:
A user funnel is a structured way to map how individuals move through your product or program, from first exposure to long-term life impact. A comprehensive funnel creates a shared framework for tracking a user’s experience through a journey.
To build a robust funnel, teams should begin by defining the final development outcome they’re targeting (Level 4) - for instance, improved learning outcomes, better health, or increased crop yields. From there, work backward to break down the journey into specific user stages.
| Stage | Description | Evaluation Level |
|---|---|---|
| 1. Recruitment | Beneficiary identified and enters program | Level 2 |
| 2. Onboarding | User introduced to AI product and completes setup | Level 2 |
| 3. Engagement | User begins actively interacting with AI product | Level 2 |
| 4. Retention | User continues engaging over time (not dropping off) | Level 2 |
| 5. Proximal Outcome | Near-term cognitive or behavioral change | Level 3 |
| 6. Development Outcome | Long-term desired result achieved | Level 4 |
For each stage, teams should clearly define:
| Element | Description |
|---|---|
| What program does | Actions to bring users into that stage |
| What user must do | User actions that count as entering the stage |
| Metric | Measurement that confirms entry (e.g., login rate, session length) |
| Target values | Target metric values and transition rates between stages |
| Costs | Costs associated with moving a user through the stage |
| DRIs | Directly Responsible Individuals for performance and metrics |
A well-designed evaluation framework is only as good as the data infrastructure that supports it. At the heart of that infrastructure is a robust ETL pipeline - a system that extracts, transforms, and loads data to power consistent, reliable measurement of program indicators.
Extract: Collect data from various sources - chat logs, product telemetry, survey tools, third-party APIs, or spreadsheets
Transform: Clean, standardize, and reshape the raw data into a usable format. This could involve timestamp alignment, anonymization, session stitching, or deriving new metrics like time-on-task indicators
Load: Store the transformed data in a centralized system (like a data warehouse or analytics dashboard) where teams can access it for analysis, visualization, or modeling
Why critical: AI products, especially those using generative models, produce high volumes of complex, often unstructured data: prompts, outputs, clicks, feedback, engagement patterns, and more. Without a clear ETL pipeline, turning raw data into actionable metrics at scale becomes unreliable and slow.
Example: A product designed to support adolescent mental health might collect model-level outputs (Level 1), engagement logs (Level 2), behavioral indicators (Level 3), and outcome data (Level 4) - all requiring integration through a robust pipeline.
Once a user funnel is in place and metrics are flowing through a robust ETL pipeline, the next challenge is understanding why certain funnel metrics are underperforming.
Process:
Identify drop-offs: Start by identifying major user drop-offs along the funnel
Develop hypotheses: Pose specific, testable questions: Why are users stalling? What mechanism explains this?
Surface competing hypotheses: For example, if engagement dips after onboarding: Is value proposition unclear? Are users overwhelmed? Do they mistrust the AI?
Test hypotheses: Each hypothesis becomes a lens for focused measurement or experiments
Goal: Make evaluation generative - helps teams ask better questions, faster. This approach sits at the intersection of product management, UX research, and behavioral science.
Test hypotheses through experimentation:
Key: Match experimentation to product maturity, hypothesis scale, and decision stakes. Tools like Evidential help teams automate randomization and track results.
Evaluation is a team sport - no single role covers all four levels.
| Level | Lead Roles | Support Roles |
|---|---|---|
| Level 1 | AI Engineers, ML Researchers | Domain Experts, Product Owners |
| Level 2 | Product Managers | Data Scientists, Data Engineers |
| Level 3 | Psychologists, UX Researchers | Data Scientists |
| Level 4 | Policy Researchers, Economists | AI Engineers |
Question: Does the AI model produce the desired responses?
Why important: AI models, especially large language models (LLMs) and related foundational models, do not “understand” content in the way humans do. Instead, they generate outputs by predicting the next word in a sequence based on statistical patterns in their training data. Because of this, models can hallucinate or appear fluent and convincing while still being inaccurate, biased, irrelevant, or even harmful.
This makes structured model evaluation essential. We need to systematically and rigorously assess whether an AI system consistently meets conditions such as usefulness, accuracy, appropriateness, and safety across diverse tasks and user contexts. This is especially critical when AI tools are deployed in sensitive domains like education, health, or agriculture, where misinformation or misalignment can cause real harm.
Beyond ensuring safety, developers must also evaluate that their AI systems exhibit desirable behaviors and characteristics proven to have a real-world impact. For instance, an AI tutor should follow pedagogical best practices - such as withholding answers to encourage self-directed learning and accurately gauging a student’s level to tailor instruction.
Most Generative AI applications are built on foundational models like those from OpenAI (GPT), Anthropic (Claude), Google (Gemini), or Meta (Llama). However, your application is a full system, not just the foundational model. It includes many other components that can be grouped into three buckets:
Pre-processing: Before handing off the input from the user to the LLM, you may wish to transform it into a format suitable for the LLM. Examples include: sanitizing or filtering language; converting speech to text; paraphrasing the user’s request; translation from a low-resource language to a high-resource one.
LLM context preparation: An LLM takes three things as input: the “prompt” or system instructions, the user’s input after being pre-processed, and a “context” which can include past conversation history, relevant content retrieved from your knowledge base, or even tools available for the LLM to call.
Post-processing: Before returning the output to the user, you may also wish to transform it into the correct format and check the output using safety or quality guardrails. Examples include: hallucination checker, converting text to speech, translation to the user’s preferred language.
Example: An AI agronomist in Senegal, answering questions from farmers in Pulaar, might: (a) check input for malicious intent, (b) translate from Pulaar to English, (c) retrieve relevant content from database, (d) retrieve information about the farmer, (e) generate an answer, (f) check that the answer is grounded, (g) translate back to Pulaar. Model evaluations cover this entire pipeline.
| Role | Responsibility |
|---|---|
| AI Engineers, ML Researchers | Execute - Lead model evaluation process |
| Domain Experts, Product Owners | Support - Define evaluation rubrics |
Responsible: Product Owners and Domain Experts (with Engineering support)
Question: “What characteristics should our AI solution embody?” These are qualitative goals (e.g., “Trustworthy”, “On-Brand”, “Concise”). Most of the rubric will be determined by your use case, context, and impact goals. This step requires reflection and discussion with stakeholders - it is critical and guides the rest of your evaluation steps.
| Organization | Product | Rubric Items |
|---|---|---|
| Jacaranda Health | PROMPTS: Maternal health SMS service (Swahili/English) | Medical accuracy, personability, simplicity (Stanford Center for Digital Health, 2025) |
| Digital Green | Farmer.Chat: Agricultural advice platform (40+ crops, 4 countries) | Faithfulness, relevance, accessibility (Singh et al., 2024) |
Recommendation: Restrict to ~5 items. Longer lists = more expensive and difficult. Tradeoffs exist (e.g., concise vs. complete, friendly vs. direct).
Responsible: Engineering (with Product Owner validation)
Engineering translates qualitative rubric into quantitative metrics (e.g., “Trustworthy” → “Factual Consistency Score”). Product Owner validates that technical metrics are acceptable proxies for business goals.
Terminology: Rubric item → Metric → Scorer → Score
Example: “helpful” → “answer relevance” → “LLM-as-judge” → “4 out of 5”
| Category | Examples | Speed | Accuracy | Cost | Best For |
|---|---|---|---|---|---|
| Statistical | BLEU, ROUGE, METEOR, WER | +++++ | ++ | + | Specific tasks |
| Model-based | AlignScore/LIM-RA, BLEURT, BARTScore, COMET | +++ | +++ | +++ | Domain-specific tasks |
| LLM-as-judge | G-Eval, RARR | ++ | +++++ | +++++ | Flexible evaluation |
| Human evaluation | Human evaluation | + | +++++ | ++++++ | Calibration & QA |
Ideal: Combination of methods. Human evaluation’s primary role: create “answer key” to calibrate automated scorers and final QA. Note: Human evaluation has its own biases.
Responsible: Product Owner (with Domain Experts and Engineering support)
Product Owner ensures quality, scope, and representativeness. Domain Experts author ideal answers. Engineering provides technical support.
| Source | When to Use | Notes |
|---|---|---|
| Past transaction data | Adding AI to existing application | Extract question-answer pairs from human-answered queries |
| Human-annotated data | Building new AI offering | Generate questions + expert answers. Warning: Don’t use LLM to generate answers for experts to verify - correcting is harder than creating |
| Customize public datasets | High-quality public dataset exists | Subset and augment to match your context |
A good dataset should cover:
Responsible: Engineering (with Product Owner support)
Automate evaluations and integrate into CI/CD pipeline. Product Owner monitors performance trends over time.
| Eval Type | Examples | Frequency | When |
|---|---|---|---|
| Low-Cost | Statistical scorers (ROUGE), model-based | Every commit | Fast feedback, limited scope |
| High-Cost | LLM-as-judge scorers | Nightly/weekly/before release | Comprehensive but expensive |
Tracking: Use observability tools (Logfire, Helicone, Langfuse) to plot metric scores over time. Dashboard helps track progress against rubric goals.
Responsible: Engineering (with Product Owner support)
Initial scores will reveal areas for improvement. Results are a diagnostic tool, not a final grade. Engineering analyzes results and diagnoses root causes. Product Owner prioritizes refinement work.
| Step | Action | Purpose |
|---|---|---|
| 1. Isolate Problem | Identify which component is failing | Modern AI has many components (retrieval, prompting, model params) |
| 2. Use Traces | Inspect inputs/outputs of each component | Pinpoint root cause (e.g., ineffective retrieval, poor prompt) |
| 3. Unit Tests | Implement component-level tests | Validate specific logic, catch regressions early |
Goal: Turn evaluation into a process that helps teams ask better questions and improve iteratively.
Responsible: Product Owner (with Engineering, QA, or Security/Ethics Team support)
Product Owner ensures red-teaming is conducted and prioritizes remediation. Technical teams execute adversarial testing.
What is red-teaming? Structured adversarial testing to proactively discover vulnerabilities, biases, and failure modes before users do. Think like a malicious actor, confused user, or edge-case generator.
| Scenario | Why Critical |
|---|---|
| Agentic/Flexible Solutions | More pathways for failure (web browsing, code execution, multi-step decisions) |
| Long Conversation Histories | Cumulative errors - small issue in turn 1 amplified by turn 10 |
| High-Risk Domains | Maternal health, medical advice, financial planning - severe impact of failure |
| Population-Scale | Unknown interaction patterns; improbable behaviors will occur at scale |
Plan → Probe → Prioritize
Resources: Red-Teaming AI for Social Good Playbook (UNESCO & Human Intelligence, 2024), Planning Red-Teaming for Large Language Models (Microsoft Learn, 2024)
Question: Does the product facilitate meaningful interactions?
Why important: Beyond evaluating how the AI model performs against key metrics, organizations need to assess how well the product engages real users and whether it solves a meaningful problem for the user. It is unlikely that a product will shift development outcomes if it fails to engage its users. Like model evaluation, this type of evaluation is a continuous and iterative process, rather than one-off.
Technology companies frequently evaluate and improve products by collecting user interaction metrics and then running rapid cycles of digital experiments. For example, they may track a user’s journey on a website, automatically collecting records like which products users click on and whether they return to the site. Then, they can compare how different web or app experiences affect browsing time or user satisfaction.
Unique advantages of digital: This rapid, iterative process is enabled by two factors unique to digital interventions: (1) iterations of the product can be precisely and efficiently deployed to different users, and (2) on-platform engagement outcomes are costless to collect and transform into meaningful engagement metrics.
| Role | Responsibility |
|---|---|
| Product Managers | Execute - Directly responsible for product metrics |
| Data Scientists | Support - Apply evaluation methods |
| Engineers | Support - Build and roll out features |
Resources: AI4GD A/B Testing Playbook, AI4GD User Funnel and Metrics Playbook
| Category | Metric Type | Examples |
|---|---|---|
| Retention | User-Level Retention | DAU/MAU, session count |
| Engagement | Action-Based | Response rate, clicks, rewrites |
| Engagement | Interaction Duration | Session length, conversational turns |
| Engagement | Feature Uptake | Click-through to links, feature use |
| Non-Engagement | Quality Scores | Toxicity score, informativeness |
| Non-Engagement | Item-Level Surveys | “Helpful” ratings, “want more” votes |
| Non-Engagement | User-Level Surveys | Overall satisfaction, usability |
| Non-Engagement | User Control | Topic subscriptions, filtering |
Question: Does the product positively support users’ thoughts, feelings, knowledge, and behaviors?
Why important: Once product functions correctly (Level 1) and engages users (Level 2), ask:
Is it changing how users think, feel, or act?
User psychological and behavioral changes are early indicators of long-term development goals. These evaluations are faster and cheaper than full impact evaluations, allowing rapid iteration.
| Area | Question | Example Constructs |
|---|---|---|
| Cognitive | Are users gaining new knowledge or correcting misconceptions? Do they demonstrate improved skills or decision-making ability as a result of using the AI? | Users’ comprehension, reflection, reasoning, and perceived clarity or understanding during interaction |
| Affective | How does the product make users feel? Do users report feeling supported, motivated, and capable after interactions, or are there indications of frustration, confusion, or emotional distress? | Mood, sense of belonging, perceived empathy, trust, or comfort interacting with AI |
| Behavioral | Are users taking small but meaningful actions (e.g., asking more questions, trying out recommended behaviors) that would predict their longer-term development? | Users’ acquisition, recall, and application of factual or procedural information, and observable behaviors (e.g., asking more questions, trying out recommended behaviors) that are proxies for longer-term development outcomes |
| Role | Responsibility |
|---|---|
| Psychologists, UX Researchers | Execute - Apply evaluation methods |
| Data Scientists | Support - Design A/B tests and experiments |
On-Platform Behavioral Measures:
Short Self-Report Surveys: Validated scales, brief and specific, integrated into flow
NLP and Text Analysis: Sentiment analysis, topic modeling, LIWC, LLM-based analysis
Off-Platform Measures: Longer surveys, observer reports, objective performance data
Question: Does the product improve development outcomes?
Why important: IEs measure effects on outcomes like mortality, learning, and earnings. The challenge: many things happen simultaneously, making simple before-and-after comparisons unreliable.
Solution: Use a counterfactual - a similar sample that didn’t receive the intervention. This captures what would have happened without the intervention, allowing us to isolate the intervention’s impact.
| Method | Description | Best For |
|---|---|---|
| RCT | Random assignment to treatment/control | Most credible; gold standard |
| Propensity Score Matching | Match on observable characteristics | When randomization not possible |
| Difference-in-Differences | Compare trends before/after | When parallel trends assumption holds |
| Regression Discontinuity | Compare units just above/below cutoff | When cutoff exists and is exogenous |
RCTs are the most credible way to determine causal impact. Random assignment ensures differences can be attributed to the intervention, not population differences or external factors.
| Role | Responsibility |
|---|---|
| Policy Researchers, Economists | Execute - Apply evaluation methods |
| AI Engineers | Support - Ensure product functions as expected |
There are 3 main reasons to do an impact evaluation:
Proof of concept: By isolating the effect of the intervention from the rest of the world, the impact evaluation allows you to causally attribute changes in outcomes to the intervention - giving you proof of concept.
Proof in different settings: Once you know it works in a particular setting, with a particular target population, you may want to show it will work in other settings or for other populations - then you can do additional impact evaluations.
Cost-benefit analysis: For many funders and public sector partners, IEs are critical for decision-making. They want credible evidence that a product meaningfully improves people’s lives - beyond engagement metrics or self-reported satisfaction - before committing to scale. A well-designed IE sends a strong signal that your product works in real-world conditions, and that scaling it is likely to generate meaningful social returns (see e.g. Hauser et al., 2025; UK GOV, 2025).
IEs also help funders compare across opportunities. When paired with cost data, they allow for robust estimates of cost-effectiveness and cost-benefit analysis - crucial when governments, donors, and multilateral institutions are allocating scarce resources. In many cases, the result of an IE becomes a key input in decisions to scale, replicate, or exit.
Important: It is important to be clear on why you are doing the impact evaluation at the outset, as this will affect the data you collect and how you design the evaluation. For example, if you are doing a proof of concept evaluation, you may want to invest more in collecting a rich set of outcomes in your Level 3 evaluations to understand how these map to ultimate welfare outcomes.
IEs are high-investment undertakings, both financially (they often cost millions of dollars) and operationally (service providers have to adapt their operations, often in challenging ways, to make them work). They are most useful when your product is mature enough to test and when the decision stakes are high enough to justify the effort.
In general, consider an IE when:
You do NOT need to run an IE if your product is still in early design or if usage is inconsistent to the point where you are worried about impacts. In those cases, Level 3 evaluations can be more appropriate.
Although impact evaluations are typically conducted at later stages, designing credible and cost-effective IEs often requires thinking about design decisions far earlier in the process. Incorporating features like holdout groups, staged rollouts, or embedded randomization into the initial product architecture (which could also be useful for A/B tests) ensures that rigorous causal evaluation remains possible - without requiring disruptive redesigns later on.
Even if a full IE is not yet justified, these design choices create structured opportunities for credible inference when the time comes and can significantly reduce the burden of evaluation. Funders assessing scale readiness should look for these signals of early evaluability.
Rigorous IEs require expertise. We recommend working with an independent evaluator - such as an academic partner, a research NGO (e.g., J-PAL, IPA), or a third-party M&E firm. This enhances both the technical quality and the perceived credibility/independence of your evaluation.
At a minimum, we suggest:
Resources: Impact Evaluation in Practice (Gertler et al., World Bank), Running Randomized Evaluations (Glennerster & Takavarasha)
Focus: What is distinctive when evaluating AI-based products
Options:
Pure control: No intervention at all
Business-as-usual: No digital support or sporadic human guidance
Non-AI digital tools: Static chatbots or curated content
Human-delivered services: When AI substitutes for scarce labor (measure costs: Cost Measurement Guide)
Important: Justify selection and explain what it helps illuminate
Critical: Marginal benefit depends on what other support users already have
Measure:
Existing technology use (frequency, type, purpose)
What users rely on today (informal networks, human advisors, basic tech)
Leakage - how much control group has access to intervention
Why: Shapes incremental value added by AI product
Challenge: SUTVA assumption (same version for all treated units) often violated
AI products: Designed to improve iteratively - different participants may interact with different versions
Solutions:
Tag your versions
If A/B testing, randomize test participation
Maintain hold-out group on baseline version
-Pre-specify at high level (not overly detailed)
Challenge: AI tools often simulate expertise - does user learn or just copy?
Solutions:
Use industry-standard validated assessments
Use administrative data
Avoid measures that can be gamed by repeating AI output
Test ability when users don’t have access to AI
Challenge: AI tools designed for scale - freely accessible, easy to share
Strategies:
Controlled access: Individual or cluster assignment
Publicly available: Randomized encouragement design
High contamination risk: Run in settings with low existing exposure
Cluster randomization: By school or clinic
Monitor usage: Be prepared to adjust power calculations
Level 1 - Model: Does the AI model produce desired responses? - Rubric → Metrics → Golden Dataset → Automated Evals → Refine → Red-team
Level 2 - Product: Does the product facilitate meaningful interactions? - A/B tests, engagement metrics, retention, feature uptake
Level 3 - User: Does product support thoughts, feelings, knowledge, behaviors? - Cognitive, affective, behavioral outcomes via surveys, NLP, behavioral measures
Level 4 - Impact: Does access improve development outcomes? - RCTs and other methods to measure causal impact on mortality, learning, earnings
Level 1 Case Studies: - Generative AI for Health in Low & Middle Income Countries - Evaluation framework of PROMPTS at Jacaranda Health - Evaluation framework at Precision Development (slides) - Evaluation of Farmer.Chat at Digital Green - Evaluation of mMitra at Armman
Level 3 Case Studies: - ChatSEL Documentation - User Evaluation Workshop - ChatSEL
Authored by: - The Agency Fund
- IDInsight
- Center for Global Development
November 2025
Please reach out to Zezhen Wu for questions or comments.
This is a living playbook. Current version grounded in lessons from AI4GD accelerator teams and experts across disciplines.